{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "from pandas import Series, DataFrame\n",
    "\n",
    "trip_distance = pd.read_csv('../data/taxi-distance.csv', header=None).squeeze()\n",
    "passenger_count = pd.read_csv('../data/taxi-passenger-count.csv', header=None).squeeze()\n",
    "\n",
    "df = DataFrame({'trip_distance': trip_distance,\n",
    "                'passenger_count': passenger_count})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Beyond 1\n",
    "\n",
    "If we define outliers to be the lowest 10% and highest 10% of values, then how many are they? Why is (or isn't) this a good measure?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>trip_distance</th>\n",
       "      <th>passenger_count</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0.46</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>11.90</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>0.60</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>0.01</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>0.50</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9976</th>\n",
       "      <td>12.60</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9978</th>\n",
       "      <td>0.38</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9979</th>\n",
       "      <td>11.30</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9980</th>\n",
       "      <td>9.13</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9982</th>\n",
       "      <td>9.30</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>1984 rows × 2 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "      trip_distance  passenger_count\n",
       "1              0.46                1\n",
       "7             11.90                4\n",
       "9              0.60                1\n",
       "10             0.01                3\n",
       "13             0.50                2\n",
       "...             ...              ...\n",
       "9976          12.60                1\n",
       "9978           0.38                1\n",
       "9979          11.30                1\n",
       "9980           9.13                1\n",
       "9982           9.30                1\n",
       "\n",
       "[1984 rows x 2 columns]"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[(df['trip_distance'] < df['trip_distance'].quantile(0.1)) | \n",
    "   (df['trip_distance'] > df['trip_distance'].quantile(0.9)) ]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The good news with this measure is that it's easy to understand. The bad news is that if we have many short trips (as we do here), we might end up calling them outliers even though they're very close in value to non-outliers."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Beyond 2\n",
    "\n",
    "How many short, medium, and long trips were there for trips that had only one passenger? Note that data for passenger count and trip length are from the same data set, meaning that the indexes are the same.\n",
    "\n",
    "If we're only interested in removing the non-outlier values, then we could use the `scipy.stats.trimboth` function on our series. It takes a second argument, the proportion we want to cut from both the top and bottom."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([0.63, 0.63, 0.63, ..., 8.2 , 8.2 , 8.2 ])"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from scipy.stats import trimboth\n",
    "trimboth(df['trip_distance'], 0.1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Beyond 3\n",
    "\n",
    "The `scipy.stats.zscore` function rescales and centers (i.e., normalizes) our data set. Our mean is set to 0, values can be above and below that value. Find all of the distances for which the absolute value of the z-score is greater than 3."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "88      23.76\n",
       "238     18.32\n",
       "379     16.38\n",
       "509     16.82\n",
       "641     19.72\n",
       "        ...  \n",
       "9897    16.11\n",
       "9899    17.48\n",
       "9906    17.70\n",
       "9955    15.49\n",
       "9964    18.55\n",
       "Name: trip_distance, Length: 306, dtype: float64"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from scipy.stats import zscore\n",
    "df['trip_distance'][abs(zscore(df['trip_distance'])) > 3]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
